Skip to content

Comments

Multi-GPU#335

Merged
ngc92 merged 6 commits intomainfrom
ngc92/multi-gpu
Aug 27, 2025
Merged

Multi-GPU#335
ngc92 merged 6 commits intomainfrom
ngc92/multi-gpu

Conversation

@ngc92
Copy link
Collaborator

@ngc92 ngc92 commented Aug 23, 2025

Implement multi-gpu tasks:
design:
run_eval still calls only a single instance of eval.py, but if the task is multi-gpu, eval.py will create a pool of multiple worker processes, and call the user op on each with a different rank argument.

in the current code, input generation and testing is also handled by each process independently; I'm not sure if that's the right approach.

@Snektron Snektron mentioned this pull request Aug 25, 2025
6 tasks
import torch.distributed as dist
world_size = test.args["world_size"]
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "12356"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might wanna setup a large random port, if for some reason a job fails and the port doesn't get released

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool PR, I don't think I've seen anyone recently manually manage process pools as opposed to just calling torchrun. I think this solution works quite nicely for us though because we don't care (yet) about multi node or fault tolerance

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

The part I'm a bit worried about is the example right now can check that a kernel is globally correct by iterating over many ranks and making sure they're locally correct. That's probably not going to be true for any of the distributed AMD kernels. This is fine as long as we offload the burden of gathering all the outputs onto the user

check_copy = _clone_data(data, rank)

# first, one obligatory correctness check
output = custom_kernel(_clone_data(data, rank))
Copy link
Collaborator

@danielhua23 danielhua23 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to confirm, if that, users' custom_kernel should be designed to an API accepting and outputing single rank data? but my ref_kernel is accepting all rank data and outputing all rank result...is that conflicted or something need to be changed in my PR? https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-4634bd7a4a47ab89859ee0db3f4f3f3c8123cf18981fb4d24b4655a412777013R240

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think essentially, the user kernel should be what is called _worker in your code. I wanted to avoid passing tensor objects across process boundaries with the multiprocessing module; I see in your example, you also transfer the tensor to the CPU before returning from worker.

Copy link
Collaborator

@danielhua23 danielhua23 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, _worker is actually the kernel for each rank to run. In your case, what you expect people submit is the custom_kernel including worker only? does this also work based on your infra? In other words, does the submission.py(https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-e7f1884ce42d6b30a70551a1a922b2547b8cb7b91e959a2ceb6f4471740dde95) work based on your infra?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big thanks, I saw siro's commit to my PR, which is compatible with this PR

@ngc92
Copy link
Collaborator Author

ngc92 commented Aug 26, 2025

This is a really cool PR, I don't think I've seen anyone recently manually manage process pools as opposed to just calling torchrun. I think this solution works quite nicely for us though because we don't care (yet) about multi node or fault tolerance

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

The part I'm a bit worried about is the example right now can check that a kernel is globally correct by iterating over many ranks and making sure they're locally correct. That's probably not going to be true for any of the distributed AMD kernels. This is fine as long as we offload the burden of gathering all the outputs onto the user

my first attempt was to just torchrun eval.py, but that breaks our file descriptor approach for result reporting. so i followed @danielhua23's example and did the process startup manually, which fit quite nicely with the existing code that had a process pool of one process anyway.

If you cannot do the check locally, there's nothing preventing us from having all-gather calls inside the validation function.

@ngc92 ngc92 force-pushed the ngc92/multi-gpu branch 3 times, most recently from e5b7a0d to e2071bd Compare August 26, 2025 10:53
@kiukchung
Copy link

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

Managing your own processes is perfectly valid. torchrun handles things like: error propagation, proces-group-gang-semantics, better logging/console output, etc. That said, it was primarily built for training applications.

The use-case here seems to be data-parallel eval in a single host (but multi-gpu) where the rank is used to partition the eval dataset, so perhaps gang-semantics is not desired (e.g. torchrun will SIGTERM all workers if at least 1 worker fails, which is the behavior we want during training).

For completeness I'll mention that torchrun can be used programmatically (see: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L927-L930):

from torch.distributed.launcher.api import elastic_launch, LaunchConfig, DefaultLogsSpecs, Std

config = LaunchConfig(
   role=f"eval_{name_of_task}", # name the worker procs (mostly informational)
   min_nodes=1,
   max_nodes=1, # set both min and max to 1 in your case (single host)
   nproc_per_node=num_gpus, 
   rdzv_endpoint="127.0.0.1",
   rdzv_backend="c10d",
   max_restarts=0, # do not restart the gang (fail on the first failure)
)
  
elastic_launch(config, entrypoint=main_fn)(main_fn_args)

@ngc92 ngc92 force-pushed the ngc92/multi-gpu branch 2 times, most recently from 71c40c1 to 54f7716 Compare August 26, 2025 23:11
@github-actions
Copy link

github-actions bot commented Aug 26, 2025

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  consts.py
  task.py 119, 162
Project Total  

This report was generated by python-coverage-comment-action

@ngc92 ngc92 force-pushed the ngc92/multi-gpu branch 3 times, most recently from 776931b to 7f8c48f Compare August 26, 2025 23:24
@ngc92 ngc92 changed the title [WIP] Multi-GPU Multi-GPU Aug 27, 2025
@ngc92 ngc92 merged commit aa3e6ae into main Aug 27, 2025
5 checks passed
@ngc92 ngc92 deleted the ngc92/multi-gpu branch August 27, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants